Probing the Statistical Properties of Unknown Texts: Application to the Voynich Manuscript
نویسندگان
چکیده
While the use of statistical physics methods to analyze large corpora has been useful to unveil many patterns in texts, no comprehensive investigation has been performed on the interdependence between syntactic and semantic factors. In this study we propose a framework for determining whether a text (e.g., written in an unknown alphabet) is compatible with a natural language and to which language it could belong. The approach is based on three types of statistical measurements, i.e. obtained from first-order statistics of word properties in a text, from the topology of complex networks representing texts, and from intermittency concepts where text is treated as a time series. Comparative experiments were performed with the New Testament in 15 different languages and with distinct books in English and Portuguese in order to quantify the dependency of the different measurements on the language and on the story being told in the book. The metrics found to be informative in distinguishing real texts from their shuffled versions include assortativity, degree and selectivity of words. As an illustration, we analyze an undeciphered medieval manuscript known as the Voynich Manuscript. We show that it is mostly compatible with natural languages and incompatible with random texts. We also obtain candidates for keywords of the Voynich Manuscript which could be helpful in the effort of deciphering it. Because we were able to identify statistical measurements that are more dependent on the syntax than on the semantics, the framework may also serve for text analysis in language-dependent applications.
منابع مشابه
Statistical Properties of European Languages and Voynich Manuscript Analysis
The statistical properties of letters frequencies in European literature texts are investigated. The determination of logarithmic dependence of letters sequence for one-language and twolanguage texts are examined. The pare of languages is suggested for Voynich Manuscript. The internal structure of Manuscript is considered. The spectral portraits of two-letters distribution are constructed.
متن کاملUnsupervised Analysis of the Voynich Manuscript
The aim of this project is to research the possibilities of applying unsupervised learning techniques for natural language and other sequential data to undeciphered texts and manuscripts. The undeciphered text used is the Voynich Manuscript, a mysterious book from the 15th or 16th century that is written in an unknown script. Some methods that could be applied to manuscripts such as these will ...
متن کاملStatistical Analysis of Unknown Written Language: The Voynich Manuscript
The Voynich Manuscript is a document written in an unknown language or cipher. This research proposal presents an idea into determining possible relationships within the Voynich. This is to be performed through known statistical methods relating to linguistics. The document reviews previous research carried out by other researchers. The proposed method is given and shows the current results obt...
متن کاملCo-Occurrence Patterns in the Voynich Manuscript
The Voynich Manuscript is a medieval book written in an unknown script. This paper studies the distribution of similarly spelled words in the Voynich Manuscript. It shows that the distribution of words within the manuscript is not compatible with natural languages.
متن کاملKeywords and Co-Occurrence Patterns in the Voynich Manuscript: An Information-Theoretic Analysis
The Voynich manuscript has remained so far as a mystery for linguists and cryptologists. While the text written on medieval parchment -using an unknown script system- shows basic statistical patterns that bear resemblance to those from real languages, there are features that suggested to some researches that the manuscript was a forgery intended as a hoax. Here we analyse the long-range structu...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره 8 شماره
صفحات -
تاریخ انتشار 2013